Quantization of a large language model

Published

May 23, 2023

This is a test of the AutoGPTQ quantization library. The pip version of the model is not always up to date. It is better to clone the git repository, and run pip install . from inside the cloned repo. Note: triton is available only on linux hosts.

#!pip install auto-gptq[triton]

import logging
logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
    )

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from transformers import AutoTokenizer

pretrained_model_name = "facebook/opt-125m"

The default for desc_act is True.

quantize_config = BaseQuantizeConfig(bits=4, group_size=128)

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)

examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

examples[0]

{'input_ids': [2, 39545, 12, 571, 3320, 1343, 16, 41, 1365, 12, 560, 12, 3698, 1421, 24934, 1938, 5560, 19, 3018, 12, 6928, 6256, 354, 6, 716, 15, 272, 10311, 1864, 17194, 4], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The following step takes care of the weight quantization and took about 10 mins. There is no training loops. The examples are used to measure model loss due to quantization.

model.quantize(examples)

VRAM uses increases upto 1100 MiB during the quantization process. Probably because of triton. However, quantization should be possible using RAM only.

import os
quantized_model_dir = "opt-125m-4bit-128g"
os.makedirs(quantized_model_dir, exist_ok=True)

Safetensors are supposed to be more memory efficient.

model.save_quantized(quantized_model_dir, use_safetensors=True)

del model

import torch

torch.cuda.empty_cache()

Even after deleting the model and freeing up the cache, I don’t see decrease in the VRAM usage.

The following step takes about 3 minutes. This is considerably slower than other methods I tested. Obabooga loads a bigger model almost within 10 seconds.

# load quantized model, currently only support cpu or single gpu
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
                                           device="cuda:0",
                                           use_triton=True,
                                           use_safetensors=True,
                                          )

2023-05-23 15:15:35 WARNING [auto_gptq.modeling._base] use_triton will force moving the whole model to GPU, make sure you have enough VRAM.
2023-05-23 15:15:35 INFO [auto_gptq.modeling._base] lm_head not been quantized, will be ignored when make_quant.
2023-05-23 15:15:35 WARNING [accelerate.utils.modeling] The safetensors archive passed at opt-125m-4bit-128g/gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
2023-05-23 15:15:35 INFO [auto_gptq.nn_modules.qlinear_triton] Found 3 unique KN Linear values.
2023-05-23 15:15:35 INFO [auto_gptq.nn_modules.qlinear_triton] Warming up autotune cache ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [02:59<00:00, 14.96s/it]

As we can see, the inference does not work properly.

# inference with model.generate
print(tokenizer.decode(
    model.generate(**tokenizer(
        "auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

</s>auto_gptq is is is is is is is is is is is is is is

Overall, the package can be used to quantize a bigger model from HF. The inference capability should improve with the model size.